-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading #11845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns. |
This loader doesn't unload back to RAM at all so it wont spill to pagefile. The idea is, if you dont have enough RAM, just dump it, because its faster to just read it from file on disk again than to write and read to pagefile. If you do have enough RAM, the OS will just leave the model in disk cache from the first load. So this should be faster for you. You margins are very low, you might do well --disable-pinned-memory but if you try it, try it both ways. kudos for LTX2 on 16 and these performance points is what im trying to really make work here. |
Loading the model file from the disk again does seem to be a nice way to prevent useless writing to pagefile atleast. It will be very useful for 16GB RAM users. I use the model without any changes to startup args with GGUFs. It works perfectly, though the only issue currently is the offloading the whole model to RAM and pagefile to make space for 4GB haha |
|
The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look . |
|
awesome i hope this gets implemented |
I am looking at the screenshots of this node, where does the model offload to? The ram use doesn't increase even with model only using 6GB VRAM, what's the caveat? Edit: NVM it offloads to RAM |
This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way. TLDR, it prevents useless writing to pagefile |
|
1、Large RAM savings. in your pic , saving about 1/2😲 |
|
i tested some results in qwen edit using gguf. Speed the same, Ram almost the same,Vram seems more stable than before。 LTX2 gguf is very bad. Ram is more, speed 1/4, Vram full reserve-vram is useless now, not good |
|
@zwukong note, gguf is not officially supported in ComfyUI and requires the use of a custom node pack, which at this moment does not account for anything changed in this PR since it's so new. For testing, please don't use gguf models at this time! |
That is practically similar to what I usually use normalvram to minimize both RAM & VRAM usage. Unlike lowvram that forcefully store model in RAM after use, or highvram that forcefully keep the model in VRAM after use, normalvram will free the model after use when there is not enough free memory, thus avoiding swap file usage. |
|
GGUF should be your priority i think, 4090/5090 only 10%, 90% use gguf under 20G vram. Even kj uses gguf too,he has a 5090 |
Don't think this flag does anything these days: ^^ No users in code search. lowvram semantics has been the default for a while, as models got too big to assume that users could fit them in VRAM by default. |
I certainly do not use GGUF on 5090 or even 4090, that would be missing out on fp8 matmuls (the fast mode). The common misconception seems to be that you have to use GGUF for low VRAM systems, which isn't true when offloading exists and is rather effective in ComfyUI currently. This PR would make it work even better. Of course on low RAM systems there's less options for offloading and GGUF becomes useful. |
|
Yes, mainly for low RAM and less than 1/4 smaller size. FP8 or FP16 eats too much RAM ,at least twice. So if you use text encoder and unet both are fp8 , then fourth the size . And now RAM and SSD are more expensize than video card😄 |
|
comfy_aimdo seems to break ROCM even when this is not in use. It tries to dynamically load libcuda.so.1 which does not exist. |
|
Are there plans to open-source |
|
comfy-aimdo will be open sourced before this pull request is merged. For gguf we will try not to break it but we are focusing on improving our own native quant system to make it better/faster than gguf. |
Any plans of converting existing models into NVFP4? I am trying to make FLUX SRPO managed it but quality dropped sharply |
|
@comfyanonymous I know it is fp4, but 40card can not run. We need Int4 as well, like nunchaku does. Why GGUF is the best for now is quality and size, even Q2 can get pretty good result, Q3 and above almost the same as fp16 |
|
There is a PR for int4. This PR is for memory management improvements with comfy-aimdo, would be appreciated if this thread was for testing this PR and not feature requests |
|
Not just feature requests, when i tested this PR, GGUF can not benefit. So I want GGUF to be supported too. Most of us uses the Great GGUFs. Almost all my models (about 99%) are GGUF |
|
Would the Windows shared memory avoidance stuff have any effect when using WSL? If not, and with the changes now maximising VRAM usage ( I've noticed some slowdowns/stalls on subsequent runs along with the normal signs of shared memory sluggishness (low temperatures, low power draw, 100% GPU usage, along with reported shared memory usage in task manager) when testing out the PR, and wonder if the changes might not be suited for WSL as-is. |
By default, Windows will only allow 1/2 of your system memory to be used by WSL(without modifying the .wslconfig with memory=24GB or whatever you want to set it to), so you're already going to run into issues quickly. But as far as I know, ComfyUI should pick up on that value and if it does, then it means any other memory management math should also pick up on it as well. Though at the GPU driver level, I'm not sure how they handle shared memory, when used with WSL. |
You are right that I ignore --reserve-vram for the moment. It can be implement with a bit of plumbing and ill take it as a feature request (along with --novram), but we might not do that one in V1 as you can just opt out for the interim. Yeah so WSL is actually a big problem and very difficult (maybe impossible) to fix with regards to shared memory spilling. When you are under WSL you will present as linux to aimdo which wont have its anti-spill in play which is windows specific. Even if we could detect WSL we would not have access to the APIs needed to detect the spill as they are only visible on the host windows. WSL has value from a linux familiarity point of view and solves some software packaging problems, but unfortunately the extra layer of indirection between comfy and the gpu creates multiple performance problems. If you optimizing comfy performance and like the linux env I VERY strongly recommend a dual boot setup as I have observed major performance differences in offloading setups where linux just beats windows with all over variables held the same (I dual boot my day-to-day test machine between Ubuntu and Win11). |
Sync before deleting anything.
This is needed for aimdo where the cache cant self recover from fragmentation. It is however a good thing to do anyway after an OOM so make it unconditional.
Be more tolerant of unsupported platforms and fallback properly. Fixes crash when cuda is not installed at all.
If running on Windows, defer creation of the layer parameters until the state dict is loaded. This avoids a massive charge in windows commit charge spike when a model is created and not loaded. This problem doesnt exist on Linux as linux allows RAM overcommit, however windows does not. Before dynamic memory work this was also a non issue as every non-quant model would just immediate RAM load and need the memory anyway. Make the workaround windows specific, as there may be someone out there with some training from scratch workflow (which this might break), and assume said someone is on Linux.
The CoW MMAP as used by safetensors is hardcoded to CoW which forcibly consumes windows commit charge on a zero copy. RIP. Implement safetensors in pytorch itself with a READ mmap to not get commit charged for all our open models.
This isn't worth it and the likelyhood of inference leaving a complex data-structure with cyclic reference behind is now. Remove it. We would replace it with a condition on nodes that actually touch the GPU which might be win.
This is phase 2
This is needed for deepcopy construction. We shouldnt really have deep copies of MP or MODynamic however this is a stay one in some controlnet flows.
549ce2d to
aef8d00
Compare
|
rebased to 0fc1570 (v0.10.0 +9) |
Now that the model defined dtype is decoupled from the state_dict dtypes we need to be able to handle worst case scenario casts between the SD and VBAR.
Scan created models and save off the dtypes as defined by the model creation process. This is needed for assign=True, which will override the dtypes.
If the model defines a dtype that is different to what is in the state dict, respect that at load time. This is done as part of the casting process.
Thanks for the test. Can you confirm the version of the PR you tried (just type "git show")? Im making changes every day as things come in and if I can associate this data with specific revision that helps. Can I get your PCIE bus width and generation? I am very very interested in your data if you do exactly the same setup with --disable-pinned-memory, both for your memory numbers and execution times. The longer story: You RAM consumption as reported by nvtop is usually nothing to worry about as its measuring utilization as opposed to committed memory. Committed memory exhaustion is the one that OOMs and crashes systems. Open task manager and have a look at the memory page and you will see the "Committed" number. This should be lower with the PR. In this PR the model remains in RAM but as a soft uncommitted allocation which windows will automatically free if the system comes under RAM pressure (i.e. its not committed). Because you just load and use the same big model 4 times, this just flatlines on the peak which is fine. The pinned memory is however fully committed and a separate allocation. So if you have the RAM space it will keep around both the pinned copy and original copy of the model and nvitop will count both. |
|
@rattus128 On the previous test, it was on commit 96e5d45 Anyway, Here is a new test run with the latest changes 2d96b2f. Sorry if it’s messy lol. 1. Without dynamic_vram
LogsFound comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static
Import times for custom nodes:
0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials
Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.459s (created=0, skipped_existing=1689, total_seen=1692)
Starting server
To see the GUI go to: http://127.0.0.1:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 2963.00 MB usable, 160.31 MB loaded, full load: True
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5631.80 MB usable, 5294.59 MB loaded, 2968.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5465.60 MB usable, 5129.60 MB loaded, 3133.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1090.13 MB usable, 226.02 MB loaded, 17090.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00, 4.15s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 50.00 seconds
got prompt
loaded partially; 1088.13 MB usable, 224.02 MB loaded, 17092.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00, 4.26s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 23.51 secondsPeak Commited Memory
Second run
2. With dynamic_vram & pinned memory enable.
LogsFound comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static
Import times for custom nodes:
0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials
Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.433s (created=0, skipped_existing=1689, total_seen=1692)
Starting server
To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00, 6.27s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 60.80 seconds
got prompt
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00, 6.82s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.64 secondsThe peak committed memory and the second run are similar, both stayed at around 47 GB.
3. With dynamic_vram &
|
Test Evidence Check |
Not sure if there's a better place to raise this but this bot has commented on most (every?) PR with this message even if there are no issues in the description, meaning it's quite a lot of noise (an email from GitHub per subscribed PR). |
Thanks for this data. It's definitely worth looking into and I am tracking it here: rattus128#2 I'll look into it when I get a chance, I have a system pretty similar to yours (3060+64GB) so hopefully I clean reproduce. |














To try it:
NOTE: This work does not have any GGUF integration and GGUF will not see any benefits yet.
NOTE: I am aware of increase Windows RAM usage when not configuring a pagefile due to commit quota exhaustion. If anyone is testing please stay tuned for a major fix to windows RAM usage incoming. The VRAM stuff on windows is still testable. Linux is unaffected.(FIXED)If you try it, please reply to the PR (if is hasn't been merged) with any issues or feel free to make an issue ticket for bigger test cases with logs and numbers.
Features
A new ModelPatcher implementation which backs onto comfy-aimdo to implement varying model load levels that can be adjusted during model use. The patcher defers all load processes to lazily load the model during use (e.g. the first step of a ksampler) and automatically negotiates a load level during the inference to maximize VRAM usage without OOMing. If inference requires more VRAM than is available weights are offloaded to make space before the OOM happens.
This will eventually allow for development of ComfyUI without needing to estimate model VRAM usage at all.
Large RAM and Windows commit-charge savings. No need to load models fully to RAM. This also gives a much higher chance of having a model in disk cache and saving the user from a disk load delay on first run as there is no primary load to process memory displacing the disk-cache any more.
Windows GPU shared memory usage avoidance
A deep copy of the model is cut in the safetensors save process (incidental improvement)
Reduced VRAM usage in async offload stream which cuda malloc disabled (pre-requisite improvement)
Implementation Details
Aimdo readme here: https://pypi.org/project/comfy-aimdo/
The long story on RAM: Aimdos ability to just evict weights means its no longer possible to .to() a weight back and forth from the GPU. VRAM pressure can occur at any time during inference and there is no clean way to .to() weights or modules back to the CPU while pytorch is stacked in the middle on a pending VRAM allocation. So as we can never .to() a weight we instead take the opportunity to leave the model parameter as known to pytorch on the CPU permanently with assign=True state dict loading. Since its never write touched it lives in mmap permanently and never consumes any process allocated RAM. Several community developers have flagged this as a possible major enhancement to comfy already and the needed changes to model load and unload align with the VRAM problems.
(NEW) Windows has extra RAM complications with its pessimistic allocation and how it forbids overcommit other than with the pagefile. Two changes are made to drastically reduce commit charge. Linear nn.Modules are now constructed without the placeholder weight as this consumes commit charge. The other change is a lightweight safetensors load that loads files in READ mode (safetensors package uses CoW) which avoids getting commit-charged for the whole model on file load.
As for loading the weight onto the GPU, that happens via comfy_cast_weights which is now used in all cases. cast_bias_weight checks whether the VBAR assigned to the model has space for the weight (based on the same load priority semantics as the original ModelPatcher). If it does, the VRAM as returned by the Aimdo allocator is used as the parameter GPU side. The caster is responsible for populating the weight data. This is done using the usual offload_stream (which mean we now have asynchronous load overlapping first use compute).
Pinning works a little differently. When a weight is detected during load as unable to fit, a pin is allocated at the time of casting and the weight as used by the layer is DMAd back to the the pin using the GPU DMA TX engine, also using the asynchronous offload streams. This means you get to pin the Lora modified and requantized weights which can be a major speedup for offload+quantize+lora use cases, This works around the JIT Lora + FP8 exclusion and brings FP8MM to heavy offloading users (who probably really need it with more modest GPUs). There is a performance risk in that a CPU+RAM patch has been replace with a GPU+RAM patch but my initial performance results look good. Most users as likely to have a GPU that outruns their CPU in these woods.
Some common code is written to consolidate a layers tensors for aimdo mapping, pinning, and DMA transfers. interpret_gathered_like() allows unpacking a raw buffer as a set of tensors. This is used consistently to bundle and pack weights, quantization metadata (QuantizedTensor bits) and biases into one payload for DMA in the load process reducing Cuda overhead a little. Some Quantization metadata was missing async offload is some cases which is now added. This also pins quantization metadata and consolidates the number of cuda_host_register calls (which can be expensive).
Model saving is reworked to avoid the force_cast_weights flag which doesnt make sense in ModelPatcherDynamic. This rework was able to cut a RAM copy of the model by doing on-the-fly model patching during the save process which worked out to be a nice RAM saving while fixing my API problem.
Aimdo (under the hood) links with Windows APIs to adjust load levels based on the WDDM target VRAM usage rather than using the pytorch/Cuda stack reported numbers (which are WDDMs lies). This means as soon as shared memory spilling occurs on Windows, weights will be unloaded until you get out of the spill state and inference state will move back to VRAM.
Offload streams now have an accompanying single shared cast buffer that grows as needed. This is to avoid significant waste and fragmentation in the cast buffer streams when offloading multiple weight sizes as we don't have cuda_malloc and the pytorch allocator completely isolates memory by stream. So go a little hands on the low level to keep those allocation pools minimized. This is applied to non --dynamic_vram case when using non cuda_malloc as it doesn reduce VRAM esp on flux2 with those huge and varying weights.
Future Work
- The progress meter needs some work. Its jarring to have it stall of the first iteration when its doing a slow model load.(DONE)Example Test case:
Flux2 + Lora text to image.
RTX5090 with 8GB of VRAM consumed by non comfy application (24GB effective)
PCIE5 NVME, 96GB RAM.
Disk caches warm with model
Before:
General Memory Usage
Peak VRAM:
After (--fast dynamic_vram)
General Memory usage:
Peak VRAM:
More test data to come. Most workflows I have run are faster with this.
Im testing various things and updates bugfixes etc but enough works for a PR.